An Introduction to Graphical Methods

Exploring Baseball Data with R

Jim Albert and Claude

2026-02-26

What Is Data Visualization?

“A picture is worth a thousand numbers.”

We will learn to turn baseball statistics into pictures — and learn to read what those pictures are saying.

Our Data: 150 Years of Baseball

The Lahman package contains season-by-season statistics for every MLB player since 1871:

  • Over 100,000 player-seasons
  • Batting, pitching, fielding, salaries
  • Available free in R

We will focus on the Batting table

Each row = one player’s stats for one season

Key statistics: AB (at-bats), H (hits), HR (home runs), SO (strikeouts), BB (walks)

Part 1: Histograms

What Does a Histogram Show?

A histogram shows how the values of one variable are spread out

  1. Divide the range into equal-width bins (buckets)
  2. Count how many observations fall in each bin
  3. Draw a bar — taller bars = more players in that range

Good for answering:

  • What is a typical value?
  • Are most values bunched together or spread out?
  • Are there any extreme outliers?

Home Runs: Most Players Hit Few

Reading the Home Run Histogram

Three things to notice:

1. Where is it tallest? Around 0–10 home runs — most regulars are not power hitters.

2. What is the shape? The bars fall off gradually to the right. This is called a right-skewed distribution.

3. How far does the tail reach? A very small number of players reach 50+ HRs. These are the elite power hitters.

Batting Average Looks Different

Two Very Different Shapes

A few extreme outliers pull the distribution to the right. Most players are near zero; a few are exceptional.

Values spread out roughly equally above and below the average. This bell-like shape is called symmetric.

Small Multiples: Comparing Eras

Notice the thicker right tail in the 2000s panel — the steroid era fingerprint.

Part 2: Density Plots

What Is a Density Plot?

A density plot is a smoothed histogram

  • Instead of bars, it draws a smooth curve
  • The height of the curve at any point tells you how many players landed nearby
  • The total area under the curve always equals 1

Why use it?

Density plots are better than histograms for comparing two groups — overlapping curves are much easier to read than overlapping bars.

Comparing Two Eras: Home Runs

Reading the Era Comparison

Steroid era (red):

Thicker far-right tail. A handful of players posted historically unprecedented 50–70 HR seasons.

Modern era (blue):

Higher in the 15–30 HR range — more players adopting power-oriented “launch-angle” swings, but fewer extreme outliers.

Batting Average Has Declined

The curves have the same shape but the modern era (dark blue) sits noticeably to the left — the statistical signature of today’s strikeout-heavy, power-first game.

Part 3: Scatterplots

What Does a Scatterplot Show?

A scatterplot shows the relationship between two variables

  • One dot per observation (one player-season)
  • X position = first variable
  • Y position = second variable
  • Patterns in the dot cloud reveal relationships

Good for answering:

  • Do two variables move together?
  • Is the relationship strong or weak?
  • Are there any unusual outliers?

Do Power Hitters Strike Out More?

What the Dot Cloud Tells Us

Direction: The cloud trends upward left to right — more strikeouts tends to mean more home runs. Players who swing for the fences also miss more often.

Strength: The relationship is moderate — there is a clear trend but a lot of scatter. Strikeouts do not perfectly predict home runs.

Outliers: The grey box shows elite power hitters: high strikeout totals but also exceptional home run numbers.

Adding a Trend Line

Reading the Trend Line

Rising section: Through most of the data, the trend goes up — power and strikeouts rise together.

Flattening section: At very high strikeout totals (200+), the trend levels off. Too much swinging and missing becomes counterproductive even for power hitters.

Shaded band: This is the 95% confidence interval — the range of plausible trend lines. Wider band = less certainty.

Colouring by a Third Variable

What the Colours Reveal

Dark blue (walks often) clusters near the top of the plot — combining home run power with plate discipline.

These are the most dangerous hitters. Pitchers cannot simply avoid the strike zone (the batter will take a walk), so they must attack — and get punished when they make a mistake.

Red (rarely walks) spreads broadly — many free swingers strike out often without generating home runs in return.

A Negative Relationship

A strikeout is, by definition, an at-bat with no hit — so more strikeouts directly pull batting average down. But there is still a lot of scatter around the trend.

Positive vs Negative Relationships

A trend line that slopes up = positive relationship. A trend line that slopes down = negative relationship. The steeper the slope, the stronger the relationship.

Summary

Three Plot Types, Three Questions

Histogram

How is one variable spread out?

  • Shows distribution shape
  • Reveals skew and outliers
  • Great for a single variable

Density Plot

How do two groups compare?

  • Smoothed version of histogram
  • Overlapping curves easy to read
  • Great for comparing groups

Scatterplot

How do two variables relate?

  • One dot per observation
  • Reveals direction and strength
  • Add colour for a third variable

What’s Next?

More plot types to explore:

  • Bar charts — compare groups or teams
  • Line plots — track trends over time
  • Box plots — compare distributions across many groups at once

To run these plots yourself:

  1. Install R and RStudio (both free)
  2. Run in the R console:
install.packages(c(
  "Lahman",
  "tidyverse"
))
  1. Open the .qmd file in RStudio and click Render

Questions?

Companion HTML document contains full explanations and all R code.